PCA (Principal Component Analysis) is a unsupervised machine learning algorithm used to identify several uncorrelated principal components. This technique can find strong patterns and explanations of variance that are difficult to see. Explore the tabs below to see how PCA can be used with different datasets.
When working with massive data sets with thousands of columns it could be useful to transform the data between weighted (principal) components that explain the majority of variation in the data. Some reasons to use PCA are:
Think of PCA as a change of coordinates for each point. Shown below is a subset of a dataset.
datatable(head(exampleDF), style = 'bootstrap', options = list(dom = 't'))Our goal here is to transform this data into linear combinations of the principal components. Through these new components we will get a new coordinate system, which is useful for dimension reduction. Imagine our dataset, ‘\(A\)’, as an ‘\(N \times d\)’ matrix and we wish to transform this into a \(N \times m\) matrix. \[ A \in \mathbb{R}^{N \times d} \Longrightarrow A \in \mathbb{R}^{N \times m} \] By reducing \(d\) columns to \(m\) columns we are eliminating columns that are strong linear combinations of one another. To do this we must find the covariance matrix for matrix \(X\). Before solving the covariance matrix we need to standardize the data. Once done, solve for a \(d \times d\) covariance matrix \(C\):
\[ \begin{align} cov &= \frac{1}{N - 1}X^{T}X \\ &= \frac{1}{N - 1}\sum^N_{z=1}X_{z,i}\bullet X_{z,j} \end{align} \] The covariance matrix, \(C\), will have the variance of variable \(i\) for all diagonals, \(C_{i,i}\). The covariance between variables \(i, j\) is in \(C_{i,j}, C_{j,i}\). Next we want to calculate the eigenvalues (\(\lambda_i\)) of the covariance matrix. The eigenvalues are the solutions to the system: \[ det(C - \lambda I) = 0 \] Once we have the eigenvalues, we select \(m\) eigenvectors that correspond to the \(m\) largest eigenvalues. \[ \vec{m} = \{\lambda_1, \lambda_2 \cdots \lambda_m\} \qquad \lambda_i \geq \lambda_j \quad \forall \quad i>j \] To determine the amount of eigenvectors to drop and eigenvalues to keep we want to select the eigenvectors that explain the most variance of the data with the lowest amount of column variables. First calculate the variance for each eigenvalue, then divide this value by the total variance of the eigenvalues.
Lets visualize the data (centralized so \(\mu = 0\)) with respect to 3 dimensions. Try and see where the axis of most variation is.
library(plotly)
# centering with 'scale()'
center_scale <- function(x) {
scale(x, scale = FALSE)
}
#converting to a dataframe
tempDF <- center_scale(irisDF[c(gx, gy, gz)])
tempDF <- data.frame(tempDF)
#plotting in 3 dimensions (x,y,z)
p <- plot_ly(tempDF, x = px, y = py, z = pz, marker = list(size = 5), height = 600) %>%
add_markers() %>%
#specifying the layout of the graph and default view
layout(scene = list(xaxis = list(title = 'x'),
yaxis = list(title = 'y'),
zaxis = list(title = 'z')
)
)
pWe need to find which eigenvectors of the covariance matrix explain the variation in the data. After subtracting the column \(\mu\) from each point (centralize the data), we are able to find a covariance matrix. Here are the eigenvalues and eigenvectors for the subset of Iris data.
# Finding the covariance matrix of our data (leaving out the 'Species' column)
covdf <- cov(centeredIris[,1:3])
# Finding the eigenvalues and eigenvectors
eigenInfo <- eigen(covdf)
# Eigen-Values
eigenInfo$values## [1] 3.69638110 0.15685442 0.03402802
# Eigen-Vectors
eigenInfo$vectors## [,1] [,2] [,3]
## [1,] 0.38539668 0.15645943 0.9093898
## [2,] -0.09961563 0.98681521 -0.1275636
## [3,] 0.91735823 0.04142686 -0.3959011
Observe the eigenvalues that we found. We can find the culmulative sum of variance, \(T\). \[
\begin{align}
T &= \sum_{i = 1}^n \lambda_i \\
&= \lambda_1 + \lambda_2 + \lambda_3 \\
&= 3.69638110 + 0.15685442 + 0.03402802 \\
&= 3.88726354
\end{align}
\] To determine the variation explained by each principal component we simply divide \(\lambda_i\) by \(T\). Computing the explained variation for each eigenvalue yields the following: \[
var(\lambda_1) = \frac{3.691}{3.992} = 0.9509 \qquad var(\lambda_2) = \frac{0.1569}{3.992} = 0.04035085823 \qquad var(\lambda_3) = \frac{0.595}{3.992} = 0.00875 \qquad
\] The principal component of \(\lambda_1\) explains 95.1% of the variance of the data, while \(\lambda_2\) and \(\lambda_3\) explain 4% and 0.875% respectively. This means that by using principal component 1 and principal component 2 we can explain 99.1% of the variation of the data.
Let’s plot the eigenvectors alongside the data and see if our PCA components agree.
From the graph you can see that the first eigenvector explains a lot of the variation in the data, while the other two eigenvectors don’t add nearly as much information. Lets transform the data to use PC1 and PC2 as the two axes.To do this we need to multiply our centralized data matrix by \(P\) , the matrix containing eigenvectors sorted based on the magnitude of the eigenvalues. The product of this will be called \(X*\), and will be our data transformed along the eigenvalues.
library("FactoMineR")
pca <- PCA(iris[,1:3], scale.unit=TRUE, graph=FALSE)
scores <- data.frame(pca$ind$coord)
#scores$Species = iris[,c("Species")]
p <- plot_ly(scores, x = ~Dim.1, y = ~Dim.2, marker = list(size = 8), height = 600) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2')))
pIt may look like we dropped an entire column of data from the dataset, but 98.5% of the original data is still there for analysis. By transforming the data into linear combinations of the principal components, we can discard components that only explain a small percentage of variation in the data. This is useful for large data since it allows for dimensionality reduction and in turn a decrease of computation time.
We can use this reduced data for further analysis. We wish to try and classify this data into definitive groups. One such machine learning algorithm is known as K-Means. We can use this algorithm to assign each point a cluster, and then further analyze the similarities in each cluster.
library("FactoMineR")
pca <- PCA(iris[,1:3], scale.unit=TRUE, graph=FALSE)
scores <- data.frame(pca$ind$coord)
k <- kmeans(scores, centers = 3)
scores$Cluster = k$cluster
p <- plot_ly(scores, x = ~Dim.1, y = ~Dim.2, color = ~Cluster, marker = list(size = 8), height = 600) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'))) %>%
hide_colorbar()
plibrary("FactoMineR")
pca <- PCA(iris[,1:3], scale.unit=TRUE, graph=FALSE)
scores <- data.frame(pca$ind$coord)
k <- kmeans(scores, centers = 3)
scores$Cluster = k$cluster
scores$Species = as.numeric(factor(iris$Species))
p <- plot_ly(scores, x = ~Dim.1, y = ~Dim.2, color = ~Cluster, marker = list(symbol = ~Species, size = 8), height = 600) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2')))
pOften times the information stored within the data is hard to see at first glance. The data we are interested in contains the average consumption of 17 types of food in grams per person per week for four countries.
datatable(head(ukdf2), style = 'bootstrap', options = list(dom = 't'))p <- ggplot(ukdf, aes(factor(Category), Amount, fill = Country)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly(p)#using the same dataset mutated so that the columns are the countries
rotated <- read.csv("uk2.csv")
#transpose of the data set for PCA, converted back to a data frame
rotated <- data.frame(t(rotated[c(1:17), c(2:5)]))
#function to center the data
center_scale <- function(x) {
scale(x, scale = FALSE)
}
#apply the center_scale function
rotated <- data.frame(center_scale(rotated[c(colnames(rotated))]))
#computing the covariance matrix
coMat <- cov(rotated)
datatable(coMat[1:5,1:5], style = 'bootstrap', options = list(dom = 't'))\[ \Large \text{Eigenvalues} \]
#getting the eigen information
e <- eigen(coMat)
#saving the values as a dataframe for easier viewing
edf <- data.frame(e$values)
#naming the rows
rownames(edf) <- seq(1,17,1)
#rendering the table
edf[1:6,] %>%
kable(col.names = "Magnitude") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)| Magnitude |
|---|
| 108504.971 |
| 45993.722 |
| 7145.557 |
| 0.000 |
| 0.000 |
| 0.000 |
\[ \Large \text{Eigenvectors} \]
#saving the vectors as a dataframe for easier viewing
evdf <- data.frame(e$vectors)
#naming the rows and columns
colnames(evdf) <- c("EV1","EV2","EV3","EV4","EV5","EV6","EV7","EV8","EV9","EV10","EV11","EV12","EV13","EV14","EV15","EV16","EV17")
rownames(evdf) <- seq(1,17,1)
#rendering the table
evdf[,1:6] %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)| EV1 | EV2 | EV3 | EV4 | EV5 | EV6 |
|---|---|---|---|---|---|
| -0.4558165 | 0.1363311 | -0.4029357 | 0.5945762 | 0.0000000 | 0.0000000 |
| -0.0268097 | -0.0290869 | -0.0307685 | 0.6120113 | 0.0268932 | 0.0503393 |
| 0.0478535 | 0.0114693 | 0.0498089 | 0.0905221 | -0.1652756 | -0.0926986 |
| -0.0547470 | -0.2119961 | -0.2886565 | -0.1504401 | 0.7067578 | 0.3490614 |
| -0.0554048 | 0.0198675 | 0.0246793 | -0.0234021 | -0.0307836 | 0.1446307 |
| -0.0292834 | 0.0071605 | -0.0434882 | -0.0344617 | 0.0018091 | 0.0497705 |
| -0.0084246 | -0.0957200 | -0.0997785 | -0.0163238 | 0.1041396 | -0.3789877 |
| -0.0841219 | -0.0442831 | 0.0467818 | 0.0006954 | -0.3433819 | -0.0229229 |
| -0.6243112 | -0.1298216 | 0.4250752 | -0.0236682 | 0.1868294 | -0.0309901 |
| 0.3738770 | -0.7368953 | -0.1498219 | 0.2552870 | -0.0471391 | -0.0218230 |
| -0.1519486 | -0.1309302 | 0.2145775 | 0.0250026 | -0.1281214 | 0.0193053 |
| -0.3159671 | -0.1212518 | -0.6556202 | -0.3871953 | -0.3511945 | -0.1180323 |
| -0.2458587 | -0.2081178 | -0.0024745 | -0.1389245 | 0.1729363 | -0.1359751 |
| -0.0256868 | 0.0432584 | -0.0658729 | -0.0444859 | -0.1276670 | 0.6168411 |
| -0.0367924 | -0.0419739 | 0.0538921 | -0.0010112 | -0.2744130 | 0.5265419 |
| 0.2423788 | 0.5325557 | -0.2222849 | 0.0306888 | 0.1971996 | -0.0485380 |
| -0.0383551 | -0.0406179 | -0.0238925 | -0.0243678 | -0.0739668 | 0.0717726 |
Now that we have all of the eigen-information, we can decide which components we need to drop. The magnitude of each eigenvalue represents the total variance explained by the corresponding eigenvalue. This means we need to divide each eigenvalue by the sum of all eigenvalues in order to find the percentage of variation explained for each component.
#function to divide by total variance
div <- function(x) {
x / totV
}
#finding the total variance (sum of the eigenvectors)
totV <- sum(e$values)
percent <- data.frame(sapply(e$values, div))
rownames(percent) <- c("EV1","EV2","EV3","EV4","EV5","EV6","EV7","EV8","EV9","EV10","EV11","EV12","EV13","EV14","EV15","EV16","EV17")
percent %>%
kable(col.names = "Percentage Explained") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)| Percentage Explained | |
|---|---|
| EV1 | 0.6712578 |
| EV2 | 0.2845367 |
| EV3 | 0.0442055 |
| EV4 | 0.0000000 |
| EV5 | 0.0000000 |
| EV6 | 0.0000000 |
| EV7 | 0.0000000 |
| EV8 | 0.0000000 |
| EV9 | 0.0000000 |
| EV10 | 0.0000000 |
| EV11 | 0.0000000 |
| EV12 | 0.0000000 |
| EV13 | 0.0000000 |
| EV14 | 0.0000000 |
| EV15 | 0.0000000 |
| EV16 | 0.0000000 |
| EV17 | 0.0000000 |
From this calculation we can see that the first 3 eigenvalues account for essentially 100% of the variation in the data. This means that we can drop the other 14 eigenvalues. The rule of thumb is to drop components whose eigenvalues are less than 1. After dropping those components we are left with the following eigenvalues and eigenvectors.
\[ \Large \text{Usable Principal Components} \]
#the first three usable eigenvectors
eVecs <- data.frame(t(e$vectors[1:3,]))
eVecs %>%
kable(col.names = c("EV1", "EV2", "EV3")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)| EV1 | EV2 | EV3 |
|---|---|---|
| -0.4558165 | -0.0268097 | 0.0478535 |
| 0.1363311 | -0.0290869 | 0.0114693 |
| -0.4029357 | -0.0307685 | 0.0498089 |
| 0.5945762 | 0.6120113 | 0.0905221 |
| 0.0000000 | 0.0268932 | -0.1652756 |
| 0.0000000 | 0.0503393 | -0.0926986 |
| 0.0000000 | 0.0138498 | -0.0555289 |
| 0.0000000 | 0.0276476 | 0.0723731 |
| 0.0000000 | -0.0166692 | 0.0039898 |
| 0.0000000 | -0.0062372 | 0.0295759 |
| 0.0000000 | 0.0054591 | -0.0838458 |
| 0.0000000 | 0.0289935 | -0.0811746 |
| 0.0000000 | -0.1303101 | -0.3444002 |
| 0.0000000 | -0.0119043 | -0.5690974 |
| 0.0000000 | 0.1197228 | -0.4507702 |
| 0.0000000 | 0.1114126 | -0.5308279 |
| -0.5077076 | 0.7574043 | 0.0265974 |
Recall that the first component we found contained 67% of the variation in the data. The second component contains 28.5% of the variation. With these two components we preserve 95.5% of the useful information in the data, without the additional 15 columns. Let’s first plot it with one dimension to see if anything jumps out.
#translating the original coordinates to the new principal component coordinates
#getting the first eigenvector
dimOne <- data.matrix(e$vectors[1,])
#converting rotated to a matrix for multiplication
matrixData <- data.matrix(rotated)
#to do this, multiply the eigenvector by the data
newCoords <- data.frame(matrixData %*% dimOne)
#adding column names
colnames(newCoords) <- c("Dim.1")
datatable(newCoords, style = 'bootstrap', options = list(dom = 't'))#adding the y-coordinate so we can plot it
newCoords$Y <- rep(0,4)
#naming the columns and rows
colnames(newCoords) <- c("Dim.1", "Y")
#plotting with the new coords
p <- plot_ly(newCoords, x = ~Dim.1, y = ~Y, text = rownames(newCoords), mode = 'markers', height = 600) %>%
add_markers(marker = list(size = 8)) %>%
add_text(textposition = "top left") %>%
layout(
xaxis = list(title = 'PC1', range = c(-100, 120)),
yaxis = list(title = 'Y', range = c(-.1,.1))
)
p#translating the original coordinates to the new principal component coordinates
#getting the first eigenvector
dimTwo <- data.matrix(e$vectors[1:2,])
#converting rotated to a matrix for multiplication
matrixData <- data.matrix(rotated)
#to do this, multiply the eigenvector by the data
newCoords <- data.frame(matrixData %*% t(dimTwo))
#naming the columns and rows
colnames(newCoords) <- c("Dim.1", "Dim.2")
datatable(newCoords, style = 'bootstrap', options = list(dom = 't'))#plotting with the new coords
p <- plot_ly(newCoords, x = ~Dim.1, y = ~Dim.2, text = rownames(newCoords), mode = 'markers', height = 600) %>%
add_markers(marker = list(size = 8)) %>%
add_text(textposition = "top") %>%
layout(
xaxis = list(title = 'PC1', range = c(-90, 120)),
yaxis = list(title = 'PC2')
)
p#translating the original coordinates to the new principal component coordinates
#getting the first eigenvector
dimThree <- data.matrix(e$vectors[1:3,])
#converting rotated to a matrix for multiplication
matrixData <- data.matrix(rotated)
#to do this, multiply the eigenvector by the data
newCoords <- data.frame(matrixData %*% t(dimThree))
#naming the columns and rows
colnames(newCoords) <- c("Dim.1", "Dim.2", "Dim.3")
#plotting with the new coords
p <- plot_ly(newCoords, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, text = rownames(newCoords), mode = 'markers', height = 600) %>%
add_markers(marker = list(size = 8)) %>%
add_text(textposition = "top") %>%
layout(
xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2')
)
pPCA <- prcomp(rotated, scale = F)
rownames(PCA$rotation) <- ukdf2$Category
datatable(PCA$rotation[,1:3], style = 'bootstrap', options = list(dom = 't', pageLength = 17))## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:plotly':
##
## arrange, mutate, rename, summarise
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
## Loading required package: grid
ggbiplot(PCA, cex = 0.5)This biplot shows the projections of each categorical variable onto the first two principal components. We can see Fresh Potatoes and Soft Drinks standing out from the rest of the categories. These two categories look like they pushed Northern Ireland and Scotland further away. Let’s plot another grouped barchart for those two categories.
#need to drop all rows that dont have Fresh Potatoes and Soft Drinks
df <- ukdf[(ukdf$Category == "Fresh Potatoes" | ukdf$Category == "Soft Drinks"),]
p <- ggplot(df, aes(factor(Category), Amount, fill = Country)) +
geom_bar(stat="identity", position = "dodge") +
scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplotly(p)PCA can be used in many different applications, even text analysis! We are using the dataset ‘allWords.csv’, which contains word data scraped from CNNHealth and FoxHealth twitter. Take a look at this dataset below.
# Read in the data
dfwords <- read.csv("allWords.csv")
# Display the data
datatable(head(dfwords), style = 'bootstrap', options = list(dom = 't'))#we are using the libraries "tidytext" and "tidyverse"
# Converting the dataframe into a tibble for easy manipulation
allWords <- as_tibble(dfwords)
#Starting with a quick visualization to see total word frequency
ggplot(allWords, aes(n/total, fill = Source)) +
geom_histogram(show.legend = FALSE, binwidth = 0.1) +
facet_wrap(~Source, ncol = 2, scales = "free_y") +
xlab("Term Frequency")The above plot is showcasing what term frequency is. TF is the number of times a word appears in a novel divided by the total number of terms (words) in that novel. From here we can see that fox tends to have relatively more words with medium term frequency than cnn. The far right bar most likely corresponds to words like “as”, “a”, “the”, or “and”. When we run TFIDF, these terms will go to 0 and be ignored during the final step. Now we can run TFIDF and get some results.
# Running TDIDF from the tidytext library
wordScores <- allWords %>%
bind_tf_idf(Word, Source, n)
remove <- c("aug", "nov", "jul", "sep", "sug", "oct",
"jun", "there", "any", "because", "let",
"where", "est", "thats", "youve", "tells", "theyre",
"lot", "sure")
idx <- which(!wordScores$Word %in% remove)
wordScores <- wordScores[c(idx),]
datatable(head(wordScores), style = 'bootstrap', options = list(dom = 't'))#Sorting and grouping to get the top 20 words
wordScores %>%
arrange(desc(tf_idf)) %>%
mutate(Word = factor(Word, levels = rev(unique(Word)))) %>%
group_by(Source) %>%
top_n(20) %>%
ungroup %>%
ggplot(aes(Word, tf_idf, fill = Source)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~Source, ncol = 2, scales = "free") +
coord_flip()## Selecting by tf_idf
From this comparison we can see a drastic difference in words from both news sources. Let’s set up our data for PCA. Recall that PCA involves the eigen-decomposition of a covariance matrix. So firstly we need to create a covariance matrix out of our existing data.
#For now we will only worry about the tf_idf score
# We should drop all scores of
# Apply returns the set of indices not equal to 0
# c("Word", "tf_idf") selects two columns
pcaText <- wordScores[
apply(wordScores["tf_idf"], 1, function(row) all(row != 0)),c("Word", "tf_idf", "Source")
]
grouped <- pcaText %>%
group_by(Source)
grouped## # A tibble: 4,471 x 3
## # Groups: Source [2]
## Word tf_idf Source
## <fct> <dbl> <fct>
## 1 aaron 0.0000231 cnn
## 2 abandoning 0.0000571 fox
## 3 abbvie 0.0000571 fox
## 4 abdomen 0.000114 fox
## 5 abduction 0.0000463 cnn
## 6 abiltity 0.0000231 cnn
## 7 able 0.000255 cnn
## 8 abort 0.0000231 cnn
## 9 abortion 0.000116 cnn
## 10 abroad 0.0000231 cnn
## # … with 4,461 more rows
cor(grouped[,c("tf_idf")])## tf_idf
## tf_idf 1
datatable(head(pcaText), style = 'bootstrap', options = list(dom = 't'))